Transformer 架構的三要素
大型語言模型的演進標誌著一種范式轉移:從專用任務模型過渡到「統一預訓練」,即單一架構能適應多種自然語言處理需求。
這一轉變的核心在於自注意力機制,它使模型能夠權衡序列中不同詞彙的重要性:
$$Attention(Q, K, V) = softmax\left(\frac{QK^T}{\sqrt{d_k}}\right)V$$
1. 僅編碼器(BERT)
- 機制:掩碼語言建模(MLM)。
- 行為:雙向上下文;模型一次「看見」整個句子,以預測被遮蔽的詞語。
- 最適合應用於:自然語言理解(NLU)、情感分析與命名實體辨識(NER)。
2. 僅解碼器(GPT)
- 機制:自回歸建模。
- 行為:從左到右處理;僅根據先前的上下文嚴格預測下一個詞元(因果遮蔽)。
- 最適合應用於:自然語言生成(NLG)與創意寫作。這正是現代大語言模型如 GPT-4 與 Llama 3 的基礎。
3. 編碼器-解碼器(T5)
- 機制:文字到文字轉換變壓器。
- 行為:編碼器將輸入字串轉換為密集表示,解碼器則生成目標字串。
- 最適合應用於:翻譯、摘要與對應任務。
關鍵洞察:解碼器主導性
行業已大幅集中於僅解碼器架構,因其更優越的擴展法則與零樣本情境下的衍生推理能力。
VRAM 上下文視窗影響
在僅解碼器模型中,KV 快取隨序列長度線性增長。10萬上下文視窗所需之 VRAM 显著超過 8 千視窗,因此若無量化技術,本地部署長上下文模型將極具挑戰性。
TERMINALbash — 80x24
> Ready. Click "Run" to execute.
>
Question 1
Why did the industry move from BERT-style encoders to GPT-style decoders for Large Language Models?
Question 2
Which architecture treats every NLP task as a "text-to-text" problem?
Challenge: Architectural Bottlenecks
Analyze deployment constraints based on architecture.
If you are building a model for real-time document summarization where the input is very long, explain why a Decoder-only model might be preferred over an Encoder-Decoder model in modern deployments.
Step 1
Identify the architectural bottleneck regarding context processing.
Solution:
Encoder-Decoders must process the entire long input through the encoder, then perform cross-attention in the decoder, which can be computationally heavy and complex to optimize for extremely long sequences. Decoder-only models process everything uniformly. With modern techniques like FlashAttention and KV Cache optimization, scaling the context window in a Decoder-only model is more streamlined and efficient for real-time generation.
Encoder-Decoders must process the entire long input through the encoder, then perform cross-attention in the decoder, which can be computationally heavy and complex to optimize for extremely long sequences. Decoder-only models process everything uniformly. With modern techniques like FlashAttention and KV Cache optimization, scaling the context window in a Decoder-only model is more streamlined and efficient for real-time generation.
Step 2
Justify the preference using Scaling Laws.
Solution:
Decoder-only models have demonstrated highly predictable performance improvements (Scaling Laws) when increasing parameters and training data. This massive scale unlocks "emergent abilities," allowing a single Decoder-only model to perform zero-shot summarization highly effectively without needing the task-specific fine-tuning often required by smaller Encoder-Decoder setups.
Decoder-only models have demonstrated highly predictable performance improvements (Scaling Laws) when increasing parameters and training data. This massive scale unlocks "emergent abilities," allowing a single Decoder-only model to perform zero-shot summarization highly effectively without needing the task-specific fine-tuning often required by smaller Encoder-Decoder setups.